test(document): de-flake redundancy search test (ci:part2)#8
Open
test(document): de-flake redundancy search test (ci:part2)#8
Conversation
Increase timing tolerance for known flaky scenarios and bound late-drop wait to avoid callback timing hangs in CI.
This was referenced Feb 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
User-Reported Context (verbatim)
Investigation Report (verbatim from investigation)
Root Cause (ci:part2 flake)
The failing test
index > operations > search > redundancy > can search while keeping minimum amount of replicasinpackages/programs/data/document/document/test/index.spec.tswas asserting immediate completeness (collected.length === count) while the system is still rebalancing/syncing. In CI, distributedindex.search(fetch=count)can transiently short-read due to timing (indexing lag and/or missed remote RPC responses), producing the familiar signature:Failed to collect all messages X < Y. Log lengths: [...]CI History (same signature)
I scanned all available
CIworkflow runs fordao-xyz/peerbit, including rerun attempts (/runs/{id}/attempts/{n}/jobs). Only 7 matching occurrences are available (so there aren’t 30 accessible to list from current retained logs):Commonality:
collectedis always equal/close to one of the printed per-peer log lengths, consistent with “partial view at that instant”.Deterministic Confirmation
Two deterministic ways to recreate “short reads” (thus validating why the old test flakes):
MissingResponsesErrorunderremote.throwOnMissing=true, while the default/best-effort path can return< fetchwithout throwing.Fix Implemented
I brought in the existing de-flake change (from upstream commit
e1f77d6…) onto a new branch and opened a PR:packages/programs/data/document/document/test/index.spec.tswaitForResolved(...)(timeout90_000, interval200ms), bumps mocha timeout to120_000.countfrom1000→600to lower load/timing sensitivity.pending still counts buffered in-order results after late dropby racing with a short delay and increasing timeout.Verification
PEERBIT_TEST_SESSION=mock pnpm run test:ci:part-2passes onfix/ci-part2-redundancy-flake.Upstream PR (dao-xyz/peerbit)
Log
debugging-plan.md(append-only; included in this PR).Fork PR notes:
Faolain/peerbit) so you can validate it in your CI.How To Confirm (tests)
PEERBIT_TEST_SESSION=mock pnpm --filter @peerbit/document test -- --grep "can search while keeping minimum amount of replicas"ci:part2:PEERBIT_TEST_SESSION=mock pnpm run test:ci:part-2for i in {1..25}; do echo "run $i"; PEERBIT_TEST_SESSION=mock pnpm --filter @peerbit/document test -- --grep "can search while keeping minimum amount of replicas" || break; doneOptional deterministic demo (local only): you can simulate short reads by running a query with a very small
remote.timeout(e.g. 200ms) and forcing one peer’s pubsub publish to delay, then observe:remote.throwOnMissing=true->MissingResponsesError< fetchresultsNew: Local Stress-Loop Results (2026-02-06)
The flake can be reproduced locally with a tight loop.
origin/master: FAIL at iteration 11/25Failed to collect all messages 997 < 1000. Log lengths: [997,102,578]fix/ci-part2-redundancy-flake): FAIL at iteration 17/25Failed to collect all messages 317 < 600. Log lengths: [286,55,317](timed out insidewaitForResolved(...))This means the change here clearly fixes the "assert immediately" aspect (so it often avoids the fast failure), but under stress there still appear to be scenarios where full convergence does not happen within the current retry window.